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Zhengwu Zhang^*’*, Debdeep Pati^, Anuj Srivastava^* 
"Department of Statistics, Florida State University, FL, 32306 


Abstract 

Unsupervised clustering of curves according to their shapes is an important problem with broad scientific 
applications. The existing model-based clustering techniques either rely on simple probability models (e.g., 
Gaussian) that are not generally valid for shape analysis or assume the number of clusters. We develop 
an efficient Bayesian method to cluster curve data using an elastic shape metric that is based on joint reg¬ 
istration and comparison of shapes of curves. The elastic-inner product matrix obtained from the data is 
modeled using a Wishart distribution whose parameters are assigned carefully chosen prior distributions to 
allow for automatic inference on the number of clusters. Posterior is sampled through an efficient Markov 
chain Monte Carlo procedure based on the Chinese restaurant process to infer (1) the posterior distribu¬ 
tion on the number of clusters, and (2) clustering configuration of shapes. This method is demonstrated 
on a variety of synthetic data and real data examples on protein structure analysis, cell shape analysis in 
microscopy images, and clustering of shaped from MPEG7 database. 

Keywords: clustering; shapes of curves; Chinese restaurant process; Wishart distribution. 


1. Introduction 


The automated clustering of objects is an important area of research in unsupervised classification of 
large object databases. The general goal here is to choose groups (clusters) of objects so as to maximize 
homogeneity within clusters and minimize homogeneity across clusters. The clustering problem has been 
addressed by researchers in many disciplines. A few well-known methods are metric based e.g. K-means 
( MacQueen et al.[ 19671, hierarchical clustering (Ward 1963| ), clustering based on principal components, 
spectral clustering ( Ng et al.j 20021 and so on ( |Jain and Dubes [ 1988| Ozawa 1985| |. Traditional clustering 
methods are complemented by methods based on a probability model where one assumes a data generating 
distribution (e.g., Gaussian) and infers clustering configurations that maximize certain objective function 


( Banfield and Raftery}|l 993 [ [Fraley and Rafter^|1998[|2002[|2006||MacCullagh and Yang[[2008 1. A model- 
based clustering can be useful in addressing challenges posed by traditional clustering methods. This is 
because a probability model allows the number of clusters to be treated as a parameter in the model, and 
can be embedded in a Bayesian framework providing quantification of uncertainty in the number of clusters 
and clustering configurations. 

A popular probability model is obtained by considering that the population of interest consists of K dif¬ 
ferent sub-populations and the density of the observation y from the sub-population is f^. Given obser¬ 
vations yi,..., yn, we introduce indicator random variables (ci,..., Cn) such that a = k if yi comes from 
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the A:th sub-population. The maximum likelihood inference is based on finding the value of (c, /i,..., /fc) 
that maximizes the likelihood 0^=1 fci (?/*)■ Typically K is assumed to be known or a suitable upper bound 
is assumed for convenience. When yi G fk is commonly parametrized by a multivariate Gaussian 
density with mean vector and covariance matrix S^.. An alternative is to use a nonparametric Bayesian 
approach which has an appealing advantage of allowing K to be unknown and inferring it from the data. 
An advantage of such an approach is that it not only provides an estimate of the number of clusters, but also 
the entire posterior distribution. 

The vast majority of the literature on model-based clustering is almost exclusively focused on Euclidean 
data. This is primarily due to the easy availability of parametric distributions on the Euclidean space as well 
as computational tractability of estimating the cluster centers. Eor clustering functional data, e.g. shapes 
of curves, one encounters several challenges. Unlike Euclidean data, where the notions of cluster centers 
and cluster variance are standard, these quantities and the resulting quantification of homogeneity within 
clusters are not obvious for shape spaces. Moreover, it is important to use representations and metrics 
for clustering objects that are invariant to shape-preserving transformations (rigid motions, scaling, and re- 


parametrization). Eor example, Kurtek et al. (2012 1 takes a model-based approach for clustering of curves 


using an elastic metric that has proper invariances. However, under the chosen representations and metrics, 
even simple summary statistics of the observed data are difficult to compute. Other existing shapes cluster¬ 
ing methods ( [Belongie et al.[ 2002 Eiu et al.[ 20121 either extract finite-dimensional features to represent 
the shapes or project the high-dimensional shape space to a low-dimensional space ( Yankov and Keo^ 
2006 Auder and Eischer] 20121, and then apply clustering methods for Euclidean data; these approaches 
are not necessarily invariant to shape preserving transformations. Also, several methods ( Srivastava et al.[ 
2005 1 Gaffney and Smyth) [2005 I have been proposed to cluster non-Euclidean data based on a distance- 
based notion of dispersion, thus, avoiding the computation of shape means (e.g. Karcher means), but they 
all assume a given number of clusters. 

In this paper we develop a model-based clustering method for curve data that does not require the 
knowledge of cluster number K apriori. This approach is based on modeling a summary statistic that 
encodes the clustering information, namely the inner product matrix. The salient points of this approach are: 
(1) The comparison of curves is based on the inner product matrix under elastic shape analysis, so that the 
analysis is invariant to all desired shape-preserving transformations. (2) The inner product matrix is modeled 
using a Wishart distribution with prior on the clustering configurations induced by the Chinese restaurant 


process ( |Vogt et al.[ |2010[ ). A model directly on the inner-product matrix has an appealing advantage of 
reducing computational cost substantially by avoiding computation of the Karcher means. (3) We formulate 
and sample from a posterior on the number of clusters, and use the mode of this distribution for final 
clustering. We illusfrale our ideas fhrough several synlhefic and real dafa examples. The resulfs show fhaf 
our model on fhe inner producf mafrix leads fo a more accurafe esfimafe of fhe number of clusfers as well 
as fhe clustering configurations compared fo a Bayesian nonparamefric model direcfly on the data, even in 
the Euclidean case. 

This paper is organized as follows. We start by introducing two case studies in Section The math¬ 
ematical details of the metric used for computing the inner product and and the model specifications are 
presented in Section In Section we illustrate our methodology on several synthetic data examples 
and the case studies on clustering cell shapes and protein structures. Section [^closes the paper with some 
conclusions. 


2. Case studies 

We propose to undertake two specific case studies involving clustering of curve data. 
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Figure 1: Protein sequences, (a) raw protein structure data in (b) 3-dimensional components of the protein sequences, where 
the x-axis indicates the length of each sequence. 


2.1. Clustering of protein sequences 

Protein structure analysis is an outstanding scientific problem in structural biology. A large number 
of new proteins are regularly discovered and scientists are interested in learning about their functions in 
larger biological systems. Since protein functions are closely related to their folding patterns and structures 
in native states, the task of structural analysis of proteins becomes important. In terms of evolutionary 
origins, proteins with similar structures are considered to have common evolutionary origin. The Structural 
Classification of Proteins (SCOP) database (Murzin et al. 1995| l provides a manual classification of protein 
structural domains based on similarities of their structures and amino acid sequences. Refer to Fig. [T]for a 
snapshot of the proteins in and the 3-coordinates of the protein sequences. Clustering protein sequences 
is extremely important to trace the evolutionary relationship between proteins and for detecting conserved 
structural motifs. In this article, we focus on an automated clustering of protein sequences based on their 
global structures. 

2.2. Clustering of cell shapes 

The problem of studying shapes of cellular structures using microscopic image data is very important 
medical diagnosis ( Rohde et al.[ 20081 and genetic engineering ( [Thomas et al.[[2002 l. This research in¬ 
volves extracting cell contours from images using segmentation techniques ( Hagwood et al.[ 2012] ) and 
then studying shapes of these extracted contours for medical diagnosis. We will focus on the problem of 
clustering of cells according to their shapes; these clusters can be further used for statistical modeling and 
hypothesis testing although these steps are not pursued in the current paper. The specific database used here 
was obtained by segmenting the 2D microscopy images, as described in Hagwood et al. (2012 2013| ). Fig. 
(b) shows some examples of the cell contours used in this paper. In this article, we consider two types of cell 
shapes: DLEX-p46 cell shapes and NIH-3T3 cell shapes. Visually, DLEX-P46 cells are round, denoting 
normal cell shapes whereas, NIH-3T3 cells have an elongated, spindly appearance, denoting progression of 
some pathological conditions. 


3. Methodology 

Previous methods of clustering non-Euclidean objects can be broadly categorized into two parts: (1) 
clustering based on representation of the data in an infinite-dimensional quotient space under a chosen Rie- 
mannian metric and (2) clustering based on suitable summary statistic of the data e.g. distance matrices. 
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Figure 2: Example shapes, (a) example shapes in the MPEG-7 dataset; (b) example cell shapes. 


Any representation in the infinite-dimensional space involves the calculation of the mean and the covariance 


mafrix (Kurfek ef al.| 2012 

Tucker ef al. 

2013 

I which is compufafionally expensive. To avoid calculafing 

fhe mean and fhe covariance mafrix. 

Srivasfava ef al. 

20051 developed a mefhod for clusfering funcfional 


data based on pairwise distance matrix and resorted to stochastic simulated annealing for fast implementa¬ 
tion. Although the method is quite efficient, one requires the knowledge of the number of clusters apriori. 
In this paper, we develop a model based on the Wishart distribution for the inner product matrices to 


Vogf ef all 2010 

instead of fhe infinite dimensional dafa poinfs, our mefhod is compufafionally efficienf. 

However, unlike 

Adamefz and Rofh 2011 

Vogf el al. 

2010 

1 which consider a slandard melric fo cal- 


culate the distance matrices, the inner product matrix is calculated using a specific represenfafion of curves 
called square-root velocity function (SRVF) ( [Srivasfava ef al. 20111. This along wifh some regisfrafion 
fechniques make fhe inner producf invarianf fo fhe shape preserving fransformafions, fhus eliminating fhe 
drawback of ( Adamefz and Rofh]|2011t Vogf ef ah] 20101. Moreover, a Bayesian nonparamefric approach 
allows us fo do aufomafic inference on fhe number of clusfers. 

Below, we describe fhe mafhemafical framework for computing fhe inner producf mafrix. 


3.1. Inner product matrix using elastic shape analysis 


We adapf fhe elastic shape analysis infroduced in Srivasfava ef al. (20111 fo calculafe fhe inner producf 
mafrix in fhe square-root velocity function (SRVF) space for fhe non-Euclidean functional dafa. Lef /3 : 

—)• RP be a parameferized curve in wifh domain D. We resfricf our affenfion fo fhose j5 which are 
absolufely continuous on D. Usually D = [0,1] for open curves and D = for closed curves. Define F = 
{/3 : : /3 is absolufely confinuous on D] and a confinuous mapping: Q : —)• as 




if \x\ / 0 

ofherwise. 


Here, | • | is fhe Euclidean 2-norm in M^. Eor fhe purpose of sfudying fhe shape of a curve (3, we will 
represenf if as: q : D —)> where q{t) = Q{${t)). The function q : D —)• is called square-root 
velocity function (SRVE). If can be shown fhaf for any (3 ^ F, fhe resulting SRVE is square infegrable. 
Hence, we will define W) fo be fhe sef of all SRVEs. Eor every q G fhere exisfs a curve 

13 (unique up fo a consfanf, or a franslafion) such fhaf fhe given q is fhe SRVE of fhaf j3. 
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There are several motivations for using SRVF for functional data analysis. First, an elastic metric be¬ 
comes the standard metric under the SRVF representation (Srivastava et al. 20111. This elastic metric is 


invariant to the re-parameterization of curves and provides nice physical interpretations. Although the orig¬ 
inal elastic metric has a complicated expression, the SRVF transforms it into the metric, thus providing 
a substantial simplification in terms of computing the metric. 

By representing a parameterized curve /3 by its SRVF q, we have taken care of the translation variability, 
but the scaling, rotation and the re-parameterization variabilities still remain. In some applications like 
clustering of protein sequences, it is not advisable to remove the scaling variabilities as the length can 
be a predictor of its biological functions. On the contrary, in applications like clustering images with the 
camera placed at variable distances, it is necessary to remove the scales by rescaling all curves to be of 
unit length, i.e., \q{t)\‘^dt = 1. The set of all SRVFs representing unit-length curves 

is a unit hypersphere in the Hilbert manifold L^(Z), M^). We will use C° to denote this hypersphere, i.e., 
C° = {q G M^)| fj^ \q{t)\^dt = 1}. A rigid rotation in is represented as an element of SO{p), the 

special orthogonal group of p X p matrices. The rotation action is defined to be SO{p)xC° —)■ C° as follows. 
If a curve is rotated by a rotation matrix O G SO{p), then its SRVF is also rotated by the same matrix, i.e. 
the SRVF of 0/3{t) is Oq{t), where q is the SRVF of / 3 . A re-parameterization function is an element of 
r, the set of all orientation-preserving diffeomorphisms of D. For any (d € T and 7 G F, the composition 
/3 o 7 denotes the re-parameterization of fd by 7. The SRVF of /3 o 7 is given by: q{t) = We 

will use {q, 7) to denote g(7(f))\/7 in the following. 

It is easy to show that the actions of SO{p) and F on C° commute each other, thus we can form a join 
action of the product group SO{p) x F on C° according to ((0, 7 ), q) = 0{q o The action of the 

product group F x SO{p) is by isometries under the chosen Riemannian metric. The orbit of an SRVF 
G C" is the set of SRVFs associated with all the reparameterizations and rotations of a given curve and is 
given by: [q] = closureKq, ( 0 , 7 ))|( 0 , 7 ) G SO{p) x F}. The specification of orbits is important because 
each orbit uniquely represents a shape and, therefore, analyzing the shapes is equivalent to the analysis of 
orbits. The set of all such orbits is denoted by S and termed the shape space. S is actually a quotient space 
given by 5 = C"/(S'O(p) x F). Now we can define an inner producf on fhe space S which is invarianf fo 
franslafion, scaling, rofafion and reparameferizafion of curves. 

Definition 1. ( inner product on shape space of curves ). For given curves /3i, /32 G and the corresponding 
SRVFs, qi,q 2 , we define the inner product, or ([qi], [q2])e> 

■S/3 i,/32 = sup (qi, (q2, ( 0 , 7 ))) • 

7er,0e50(d) 

Nofe fhaf fhis inner producf is well-defined because fhe acfion on SO{p) x F is by isomefries. 

Optimization over SO{p) and F: The maximization over SO{p) and F can be performed iteratively as 
in ( Srivastava et al.[[2011| ). In our case, we use Dynamic Programming algorithm to solve for an optimal 7 
first. Then we fix 7, and search for the optimal rotation O in SO{p) using a rotational Procrustes algorithm 


(Kurtek et al. 20121. 


3.2. Likelihood specification for the inner product matrix 

Let 5+n (M) denote the set of all n x n symmetric non-negative definite matrices over M. Depending on 
whether we rescale the curves to have unit length or not, we define two classes of inner product matrices: 
i) = {A G S'+n(M) : an = l,\aij\ < 1,1 < i j < n} and ii) 5+n(M). In this article, we 

do not make a distinction between these two cases and specify our model for the larger subspace S+n (1^) 
irrespective of whether we rescale the curves or not. As illustrated using the experimental results in Section 
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1^ having a probability model on a slightly larger space does not pose any practical issues when we actually 
rescale the curves. 

For a scaled inner product matrix S G S+n(^), let S ~ Wn{'^, d), the Wishart distribution with degrees 
of freedom d and parameter S = E(S') of rank n(d > n). To allow rank-deficient S, a generalized Wishart 
distribution with degrees of freedom d{d < n) can be defined as 


p(S I E,ii)o< 




- ^tr(S-i5) 


( 1 ) 


where I'l implies the product of non-zero eigenvalues and tr(-) is the sum of the diagonal elements. For an 
observation of S, the log-likelihood function is 


/(S;5,d)(x-^log(|S|)-^tr(S-i5) 


( 2 ) 


for S G 5+n(M). One can easily identify this as an exponential family distribution with canonical parameter 
W = and the deviance is minimized at S = 5 (McCullagh 20091. Therefore, E encodes the 
similarity between the observed shapes measured by the inner product matrix S. For instance, T,jk encodes 
the similarity between yi and yj as measured by the inner product ([(/j], where qi and qj are the 

SRVFs of yi and yj, respectively. 

Clustering is equivalent to finding an opfimal partition of fhe dafa. We use P = {Pi,P 2 , ■ ■ ■, Pr] F 
V fo denofe a partition of sef {l,2,...,n} info K classes, where V denotes fhe sef of all partitions of 
{1,2,... ,n}. A partition P can also be represented by membership indicators {ci,i = 1,..., n}, where 


= j if i G Pj,j = 1,..., iC, or a membership matrix B G 


defined as Bij = 


1 if Cj = Cj 

0 ofherwise. 

If we assume: (1) observed shapes {yi G P,i = 1,... ,n} come from several sub-populations, and (2) 
observed shapes from fhe same populafion are placed nexf fo each ofher; one would expecf fo observe a 
block pattern in fhe inner producf mafrix S because fhe observations from fhe same clusfer will have similar 
inner producf. Fig. [^on fhe leff panel shows one example of such inner producf mafrix, which is calculafed 
from simulafed Euclidean dafa wifh fhree clusters. One can observe fhree large-value-blocks along fhe 
diagonal. 

To perform Bayesian inference on fhe clustering configurafions, we define fhe following prior on S fhaf 
enables clustering of fhe observafions. Mofivafed by ( MacCullagh and Yang[ 2008 1 Adamefz and Rofh 
201 HlWgf ef al.[ 20101, consider fhe following decomposition of S. Lef 


E = aJ + /3S , 


( 3 ) 


where a, (3 G M, I is fhe identify mafrix and B G is fhe membership mafrix. Equation Q decom¬ 

poses fhe scalar mafrix E info a sparse mafrix al and a low-rank mafrix j5B, where B encodes fhe clusfering 
informafion. Eor convenience of infroducing a conjugate prior for a ( |Vogf ef ar 20101, we re-parameferize 
Ibis model info T, = a{I + 9B), where 9 = (3/a. Infuifively, fhe parameter 9 confrols fhe sfrengfh of sim¬ 
ilarity befween fwo observations measured by fheir inner producf - a large 9 indicafes a sfrong associafion, 
and vise versa. Refer fo Eig. for an illusfrafion of fhe membership mafrix B and fhe corresponding E 
mafrix. 

Our primary goal is fo develop a Bayesian approach fo infer fhe posterior disfribufion on fhe membership 
mafrix B. To fhaf end, we denofe fhe likelihood by p(5| E, d), and we place priors on E and d, fhe unknown 
paramefers in fhe likelihood. The prior on E is induced by firsf letting T, = a{I + 9B) and fhen placing 
priors on a, 9 and B. These consfifufe our Bayesian model for S, fhe inner-producf mafrix. Below, we 
discuss fhe specificafion of prior disfribufions for a, 9, B and d. 
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Figure 3: From the left to right: an inner product matrix S, a partition matrix B and a scale matrix S. 


3.3. Priors and hyperpriors 

A popular method of inducing a prior distribution on the space of partitions V is the Chinese restaurant 
process (CRP) ( Pitman[ [2006 1 induced by a Dirichlet process ( [Fergusori] 1973[ 1974). Since a prior on 
{cj, i = 1,n} induces a prior on V and, hence, on the space of membership matrices B, it is enough to 
specify a prior on {cj, t = 1, ...,n}. We assume 


P{Cn = j \ Cl,..., Cn-l) = ” 1+5 

I n-1+5 


if Cn = j for some 1 < j < K 
otherwise. 


(4) 


C+i-l 


where nj = #{i : 1 < i < n,Ci = j} and ^ > 0 is the precision parameter which controls the prior 
probability of introducing new clusters. The expected cluster size under CRP is given by Yl^=i —~ 

eiog(^). 

3.3.1. Hyperpriors 

We need to choose hyperpriors for parameters associated with the prior distributions. 


Priors on a and 0 : a is assigned an inverse Gamma distribution, denoted a ~ Inv-Gamma(r, s) for 
constants r, s > 0. An inverse Gamma distribution for a allows us to marginalize out a in the posterior dis¬ 
tribution, thus obviating the need to sample from its conditional posterior distribution in the Gibbs sampler 


(refer to Section 3.3.2). Recall that 6 controls the strength of similarity within cluster. Thus a large 6 will 


encourage tight clusters (elements in each cluster are very similar). We will explore the sensitivity of the 
final clustering to 6 in Section|^ We assume a discrete uniform distribution for 9 on the set {6i,..., 9m}, 
with P{9 = 6»i) = ^, i = 1, ...,m. 

Choice of ^ and d: Recall that the ^ controls the prior probability of introduction of new clusters in the 
CRP Q. We start with an initial guess of the number of clusters Cq using standard algorithms for shape 
clustering ( [Yankov and Keo^ 2006 Auder and Fischer 2012). In our experience, Co/logn provides 
reasonable choice for 

Also, recall that d is the degrees of freedom for the Wishart distribution. Since d represents the rank of 
the inner product matrix S, it is natural to estimate d using the number of largest eigenvalues of S which 
explains 95% of the total variation. This forms an empirical Bayes estimate of d, denoted dEB- Let the 
eigenvalues of S to be {Ai,..., Am}, where m < n and Ai > A2 > • •. > Am- dEB is taken to be the 

smallest integer such that ^ 


1 


> 0.95. 
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3.3.2. Posterior computation and final selection of clusters 

Next, we develop a Gibbs sampling algorithm to sample from the posterior distribution of the unknown 
parameters. To that end, we propose the following simplifications to the likelihood. The trace and determi¬ 
nant that involve the a and 9 in equation ([T]) can be computed analytically (Vogt et al. 2010 Adametz and 


Roth 20111. Observe that 


|S-i| =a-'^[](l + 0n,)-' , 
i=i 


( 5 ) 


where nj is the number of elements in j*" cluster. Clearly, the cluster corresponds to the j*" diagonal 
block in refer to Fig. |^a). Let Sjj,j = 1,... J be a sub-square-matrix in S corresponding to the 


diagonal block in B, and Sjj = IjS'Ij. Let Ij G be such that the element is l(ci = j) for 

i = 1,..., n. Then 


"(S-S) = E - 1 


i=i 






Substituting (|^ and ([^ in ([T]) with ([^, we obtain. 


P{S I B, a, 9, d) oc 

J 


a 


—ndl2 


(1 + exp - — <1 tr(5) - ^ 


i=i 


i=i 




( 6 ) 


(V) 


If a ~ Inv-Gamma(ro(i/2, sod/2) it is possible to integrate out a analytically in ([t]) as P{S \ B, 9, d) = 
f P{a)P{S I B,a,9,d)da yielding 


P{S \ B, 9, d)^l[il + _ L ^ ^—^ 4 . + so 


i=i 


i=i 


-{n+ro)d/2 


( 8 ) 


Using the prior distributions for 9, B with the dEB plugged in the likelihood Q, we get the posterior 
distribution of the membership matrix B: 


P{B\S, 9, d, 0 (X P{S\B, 9, dEB)P{B\0P{9). (9) 

Fig. 1^ shows the graphical model representation of our Bayesian model. Some suggestions of specifying 
hyper-priors are summarized in Table [T] We use Markov chain Monte Carlo (MCMC) algorithm to obtain 
posterior samples B^^\ ..., B^^'l for a suitable large integer M > 0 using ([^. The detailed algorithm is 
described in the following. 


Algorithm 1. Posterior sampling using the MCMC 

Given the prior parameters dEB> f, s, 9, and the inner product matrix S from the n observations, and let 
Ng = length{9), we want to sample Iter number of posterior samples of membership matrices B: 

1. Initialize the cluster number K (a large integer), the cluster indices {cj, i = 1,..., n} and obtain the 
initial membership matrix 

2. For each sweep of the MCMC (it = t)to Iter) 
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Table 1: Suggestions for specifying hyper-priors in our model. 


Hyper-parameters 

Deseription 

Suggested values 

e 

Parameter for S, 

S = a{I + eB) 

9 ~ Uniform(6*i,..., On) 
large 9 - tight elusters, small 9 - loose elusters 

a 

Parameter for S 

a ~ Inv-Gamma(r, s), where r, s are eonstants 

d 

Degrees of freedom of Wishart 

Estimated using the rank of S 


Parameter for CRP 

^ = K/ log(n), K is initial estimated # of elusters 



Figure 4: Graphical model representation of our Bayesian model. Squares indicate fix parameters and circles indicate random 
variables. 
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(a) 


(b) 


(c) 


For each 9i,i = 1,..., iV^ , obtain posteriors P{0i\-) oc P{S\9i, , dEB)p{(^i) using 

Normalize {P{9i\-)} and sample 9^^ from the discrete distribution on the support points 9i,i = 
1,..., N 0 with probabilities {P(0i|-) ,..., P(0Arg|-)}. The complexity for this step is 0{Ng * 
k^(it)), where k^^u) is the number of clusters obtained from B^P. 

For each observation (i = 1 to n) 
i. For each cluster (j = 1 to kQ(it) + 1) 

A. Assign current observation (yi) to the j-th cluster, update the membership matrix B^P 
to B'^p and calculate the posterior iTj = P{B^^\S, 9^^,dEB,C) using Q. The complex¬ 
ity for this step w 0(1)[^ 


ii. Normalize {vri, and sample Cifrom a discrete distribution on support points 
{1,2,..., k^^it) + 1} with probabilities (tti, ..., Update Sb*), Complexity for 

this step is 0{log(fe^(it))} ij^ringmann andPanagiotm 2012 ). 


After completing Step 2b we obtain one MCMC sample of 


3. Repeat step^so that we have Iter many samples. Discard the first few samples (bum-in), and relabel 
the remaining as B^^\ ..., B^^X 


From algorithmic the complexity of each sweep of the MCMC is O{N 0 K + niFlogiF}. Usually 
Nq < n, leading to an overall complexity of 0{nK log(iF)). 

Once we obtain the posterior samples i = 1,..., M}, our goal is to estimate the clustering 

configuration. However, the space of membership matrices B is huge, and we would expect the posterior to 
explore only an insignificant fraction of the space based on a moderate values of M. Therefore, instead of 
using the mode of (Hb), f = 1,..., M}, we devise the following alternate strategy to estimate the clustering 
configuration more accurately. We treat the set of the membership matrices, denoted as Bb, as a subset of 
symmetric n x n matrices with restrictions: (1) B(i,j) = {0,1} for all i,j = l,...,n; i2)B{i,-) = B{j,-) 
and B{-,i) = B{-, j) ii observation and observation are in the same cluster. The final mafrix B* is 
obfained by calculating fhe extrinsic mean of fhe posferior samples defined as follows. 


Algorithm 2. Calculating extrinsic mean of membership matrices 

Given the samples B^^\ ..., B^^^X the extrinsic mean B* is calculated as the following: 

1. Find the mode of the number of clusters ko based on the samples B^^'^, ■ ■ ■, B^^X 

2. Calculate the Euclidean mean and threshold it onto the set of membership matrices (Tb)■' 

(a) Euclidean mean.- Let B = ^ 

(b) Thresholding.- threshold the Euclidean mean onto Tb- B* = threshold{B, t*), where t* is the 
largest threshold such that B* has fco clusters. Setting k = N and iter = M, the thresholding 
procedure is described below: 

While (k ko), do 

i. Set Jarray = {1, N}, B* = zeros{N, N). Also set iter = iter — 1, let t* = iter/M. 

ii. For j in JcLj'Y'CLyy calculate 

A. V = l{B{j, •) > t*); record the index of elements in v equal to 1, denoted as set C. 
Let Jarray = Jarray ~ C, which means remove elements in C from Jarray 

B. For i in set C, set B*{i, •) = v, B*(-,i) = v* and B{i, •) = 0, B{-, i) = 0*. 


'since only one observation changes the cluster index, one can explicitly calculate the difference between the old values of 0 
and 0 and new values in 0(1) steps. 
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(a) (b) (c) 

Figure 5: Posterior distribution of clustering number k. 


iii. Set k = H^B*, which is number of clusters in B*. 

Fig. shows some generic illustrations of the posterior distribution on the number of clusters obtained 
from B^^\ ..., B^^\ One may notice that the Euclidean mean B ^ IFb. Actually B{i,j) represents the 
posterior probability that the and the observations are clustered together. To project B into Fb, we 
find the largest value t* to threshold B, such that the thresholded B, denoted as B*, has ko clusters, and 
B* G Fb- In other words, we assign two data points to the same cluster if the posterior probability of being 
clustered together is greater than or equal to t*. It is rare that we can not find a t* fo fhreshold B fo obfain 
B* £ Fb wifh ko clusfers. In fhis circumsfance, one can eifher sample more posferiors, or re-sample fhem. 


4. Experimental Results 

In this section, we demonstrate the performance of our model (Wishart-CRP, denoted by W-CRP) both 
on synthetic data (in Section |4T| ) and the case studies (in Section |4^ introduced earlier in Section For 
the Euclidean datasets, we generated 8000 samples from the posterior distribution and discarded a burn-in 
of 1000, whereas those numbers for the non-Euclidean data are 4000 and 1000, respectively. Convergence 
was monitored using trace plots of the deviance as well as several parameters. The high effective sample 
size of the main parameters of interest shows good mixing of the Markov chain. Also we get essentially 
identical posterior modes with different starting points and moderate changes to hyperparameters. 

4.1. Synthetic Examples 

We consider several simulation settings for both Euclidean and non-Euclidean datasets. 


4.1.1. Euclidean Data 

Elicitation of hyperpriors: 

In the Euclidean case, the data are generated from a Gaussian mixture model p{x \ A) = tVig{x \ 
Pi,Tji),x £ M^, with the mixing weights Wi, i = 1,2,3 and the component-specific Gaussian densities 
g{x I p,i,T,i), i = 1,2,3. In the first experiment, we perform a sensitivity analysis to the choices of 
hyperpriors 9 and ^ in the W-CRP formulation. Recall from Section 3.3.1 that (1) 9 controls the strength 
of association between elements within clusters; (2) ^ is the concentration parameter for the CRP which 
controls the prior probability of producing new clusters. In this experiment, we test sensitivity to 9 and 
keeping all the remaining parameters fixed {dEB = 2, r = 3, s = 4). Fig. shows the clustering 
results on three different 2D Euclidean datasets (each dataset contains 60 data points) with different 0’s and 
^’s. Observations in Dataset 1 are clearly separable into three classes; Dataset 2 contains observations that 
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can either be clustered into two or three classes; and Dataset 3 contains observation that shows no clear 
congregation of observations. 

The histograms in Fig. show posterior distributions of cluster number K. In the upper left panel, 
results are provided for ^ = 0.2 and 9 G {0.1, 0.2, 0.3,0.4,0.5}. In the upper right panel, we use ^ = 10 
and the same 9 G (0.1, 0.2,0.3,0.4, 0.5}. Clearly, when ^ = 10, we have more clusters with small sizes, 
although the dominant clusters appear to be similar to the case of ^ = 0.2. In the lower left panel, we set 
^ = 0.2 and 9 G {1000, 2000, 3000,4000, 5000}. Comparing this with the result in the upper left panel, 
we can see that larger 9 values lead to more but tighter clusters. This conforms to our intuition about the 
role of 9. In the lower right panel, we used large ^ = 10 and large 9 G {1000, 2000, 3000, 4000, 5000}. 
Comparing this with the one in upper right panel, we find that when 9 is large, the estimate of the number 
of clusters is less sensitive to ^ than with a smaller value of 9. 

The next experiment presents a detailed sensitivity analysis to the choices of ^ and 9. First, we fix the 
prior on 9 (= {100, 200,300,400,500}) and investigate the relation between the inferred cluster number 
K versus the choice of Fig. [7] shows the result. The upper x-axis is ^ and bottom x-axis corresponds to 
^ X log(n), where n = 60 is the number of observations. One can see that, the inferred cluster number is 
robust to the choice of ^ in a large range. After ^ exceeds certain value, the inferred K increases. This is rea¬ 
sonable because a large ^ (in CRP) produces a prior with large probability to induce new clusters, and it dom¬ 
inates the posterior inference. Typically, one does not choose such a large ^ in most practical situations, un¬ 
less there is strong prior evidence. Next, we study the sensitivity of the relation between the inferred K and ^ 
to the choice of 9. To address this, for ^ G {0.1,0.2, 0.4, 0.8,1, 2, 5,10,12,15}, we specify priors on 9 in the 
following way: 9 = Ax{0.1,0.2, 0.3, 0.4,0.5}, where A = {1, 5,10, 20, 50,100, 200, 500,1000, 5000,10000}. 
Fig. shows a heat-map of the matrix of the estimated K for different values of ^ (across the rows) and 
A (across the columns). This figure shows (1) The posterior of cluster number K is robust to the choice of 
^ and 0 in a large range. (2) For a fixed a small 9 tends to produce more clusters and a large 9 tends to 
produce less clusters. These results are coherent with the roles of ^ and 9: (1) ^ controls the probability 
of introducing new clusters. Hence, for a very large the posterior tends to have more smaller extraneous 
clusters. We recommend choosing ^ = K/ (log(n)), TF is a preliminary estimate of the number of clusters. 
Such a choice does not overestimate the value of ^ and produces less extraneous clusters. (2) 9 controls the 
strength of association between elements within clusters. A large 9 indicates a strong association, and tends 
to have more and tighter classes. However, 9 tends to have a weak influence on the clustering result if ^ is 
not very large. 

In a very recent technical report (Miller and Harrison 20131, it is shown that Dirichlet process mixture 
(DPM) of Gaussians leads to an inconsistent estimation of the number of clusters. Instead, mixture of finite 
mixtures model (MFM) (Miller and Harrison 20131 can guarantee the consistent estimation of the number 
of clusters. In next experiment, we compare with mixture of finite mixtures (MFM) prior on the same data as 
in Fig. For a given finite mixture of Gaussians with number of components K, the MFM model assumes 
a Dirichlet distribution, Dir(Q;,. .., a) conditionally on K which assigned a Poisson(A) distribution. Fig. 

shows the clustering results with MFM prior, where we set A = 2. The left panel shows results using 
9 G {0.1, 0.2,0.3,0.4, 0.5} and the right panel shows results using 9 G {1000, 2000, 3000,4000, 5000}. 

The clustering results of CRP and MFM are very similar. 

For comparisons in the Euclidean case, we use the Dirichlet process mixture (DPM) of Gaussians 
(Heller and Ghahramani 20051 and MFM of Gaussians directly on the observations yi,i = I,... ,n. Note 
that, unlike in the non-Euclidean case, the mean and covariance matrix can be efficiently obtained in the 
Euclidean case. Eig. [^shows the results for the same dataset as in Eig. The left panel shows the results 
with DPM of Gaussians and the right panel shows the results with MEM of Gaussians. One can observe 
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Figure 6: Euclidean data clustering results under different priors on 9 and 
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Figure 7: Estimated number of clusters for different ^ when we use a fix prior on 0 (= {100, 200, 300,400, 500}). 



Figure 8: Estimated number of inferred clusters for different A and The columns of the heat-map matrix are cor¬ 
responding to A (A = {1,5,10,20,50,100,200,500,1000,5000,10000}) and the rows are corresponding to ^ = 

{0.1,0.2, 0.4, 0.8,1, 2, 5,10,12,15}). 
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Figure 9: Euclidean data clustering results using our model with MEM prior. We use A = 2 and different 8 in this experiment. The 
left part results are using 6 G {0.1, 0.2,0.3,0.4,0.5}, and the right part results are using 6 G {1000, 2000, 3000, 4000, 5000}. 


from Fig. [T^that both the methods tend to produce extraneous clusters compared to W-CRP. This confirms 
the inconsistency of DPM as in ( Miller and Harrison} 20131 and also demonstrates that the convergence 
of MFM is slow, although theoretically it might produce consistent estimates of the number of clusters 
asymptotically. 

Non-Euclidean Shape Data: 

In this experiment, we study shapes taken from the MPEG-7 database ( Jeannin and Bob^ 19991. The 
full database has 1400 shape samples, 20 shapes for each class. We first choose 100 shapes to form a 
subset of 10 classes with 10 shapes from each class. The observations are randomly permuted and the inner 
product matrix S is calculated using Definition [T] Then we perform our clustering method on S (note that 
S G [/+(M)). We impose a prior on a with Inv-Gamma(3,4), and is estimated by 7F/log(100), where K 
is an estimate of number of clusters {K = 15 in this case). The clustering result is shown in Fig. 
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where 

(a) and (b) shows the inner product (I-P) matrix before and after clustering, (c) shows the final clustering 
result, and (d) shows the histogram of cluster number K obtained from 4000 MCMC samples of B. From 
the result, one can see that our algorithm clusters these 100 shapes well other than splitting one class. 

In next experiment, we analyze the sensitivity of the cluster number K to the parameter d, degrees of 
freedom of the Wishart distribution. Note that in the Euclidean case, d can be easily estimated since d (in 
the case of d < n) is the dimension of the data. Eig. [T^ shows the estimated cluster number K versus the 
value of parameter d in the dataset shown in Eig. 
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It is evident that the estimates of K are robust to 

different choices of d. 

To compare with existing methods in the shape domain, we test our method on another subset of MPEG- 
7 dataset that was used in ( Bicego and Muring |2004t Bicego et al.[ 2004} Bicego and Murino 20071. The 
dataset contains 6 classes of shapes with 20 shapes per class. To quantify the clustering result, we use the 
“classification rate” defined in (Jain and Dubes[ 19881. Eor each cluster, we note the predominant shape 
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Figure 10: Clustering with DPM of Gaussians and MFM of Gaussians. The left panel shows the results from DPM of Gaussians 
and the right panel shows the results form MFM of Gaussians. 


class, and for those shapes assigned to the cluster which do not belong to the dominant class are recognized 
to be misclassified. The classification rate is the total number of dominant shapes for all classes divided by 
the total number of shapes. However, this measure is known to be sensitive towards larger clusters. The 
Rand index ( Torsello et aL[[2007] ) is an alternative measure of the quality of classification which measures 
the similarity between the clustering result and the ground truth, defined as RI = 0 /( 2 )- Here a is fhe 
number of fhe “agreements” between the clustering and the ground truth, which is defined as fhe sum of 
fwo quanfifies: (1) fhe number of pairs of elemenfs belonging fo fhe same class fhaf are assigned fo fhe 
same clusfer; (2) fhe number of pairs of elemenfs belonging fo differenf sefs fhaf are assigned fo differenf 
classes. If fhe clusfering resulf is fhe same as fhe ground trufh, RI = 1, ofherwise RI < 1. The Rand 
index penalizes fhe over-segmenfafion while fhe classification rale does nol. Table [^compares fhe overall 
classificafion rafe and Rand index of our mefhod wifh ofher mefhods, such as Fourier descripfor combined 
wifh supporf veclor machine based classification (FD + SVM), hidden Markov model (HMM + Wfl) wifh 
weighfed likelihood classificafion (Bicego and Murino 20071, HMM wifh OPC approach (HMM + OPC) 
(Bicego el al. 20041, elastic shape analysis (ESA) (Srivaslava el al. 20111 wifh k-medians (K-medians), 
ESA wifh pairwise clustering mefhod (ESA + PW) (Srivaslava el al. 20051. Our model, wifh Wisharl-CRP 
applied on fhe elastic inner producf (EIP) mafrix is denoled by EIP + W-CRP. The classificafion rafe, Rand 
index and fhe compufalional lime of K-medians, ESA - 1 - PW and our mefhod are obfained based on fhe 
average of 5 runs on a laptop wifh a i5-2450M CPU and 8GB memory. The compufalional time of our 
approach (EIP - 1 - DW) includes the cost of calculating the inner product matrix S (642.6 s) and generating 
the 4000 MCMC samples (131.6 s). An faster approach for calculating the elastic inner product matrix 
defined in our paper is available in ( [Huang et al.[ 20141. Eor ESA - 1 - PW and K-medians method, we set 
K = 6 since we know the true K in this case. The classification rates for ED-i- SVM, HMM - 1 - Wtl, and 
HMM -I- OPC are reported from ([Bicego and Murino 20071 , and these rates are based on the 1-nearest 
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Figure 11: Clustering process for 100 shapes. The histogram shows the posterior distrihution of the cluster number k obtained 
from 4000 MCMC samplings of B without any burn-in. 



Figure 12; Clustering sensitivity analysis of parameter d in the dataset shown in Fig. 
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Table 2: Comparison of the classification rate on MPEG-7 dataset 


Classifier 

PD-b 

SVM 

HMM-b 

WtE 

HMM-b 

OPC 

K- 

medians 

ESA-b 

PW 

ElP-b 

W-CRP 

Classification rate (%) 

94.29 

96.43 

97.4 

81.5 

96.67 

100.00 

Rand index 

- 

- 

- 

0.91 

0.98 

1.00 

Time (seconds) 

- 

- 

- 

648.5 

707.4 

774.2 


neighbor classification. As evident from the results, our model can automatically find the cluster number 
K = 6, and the classification rate is better than the competitors. 

4.2. Real Data Study 

In this section, we cluster cell and protein shapes introduced in Section 2. Automated clustering is a 
crucial goal in real data applications where it is hard to provide a rough estimate of the number of clusters 
visually. In the examples where the ground truth labels are available, our methods provide higher classifi¬ 
cation rates. 

Cell shapes clustering: We first cluster the cell shape data introduced in Section 2.2. The cell shape dataset 
includes cell shapes from different cell types: DLEX-p46 cells, and NIH-3T3 cells. DLEX-p46 cells are 
round, whereas NIH-3T3 cells have an elongated, spindly appearance. 

In the first experiment, we select 100 shapes from NIH-3T3 cell shapes to form a subset, and cluster 
them with different priors on 6: a set of small 9 {6 € {0.2,0.4,0.6, 0.8,1}) and a set of large 9 (9 € 


(200,400, 600,800,1000}). Eig. 13 shows the clustering result. The first row shows the clustering result 
with a small 9 and the second row shows the result with a big 9. One can see that with a small 9, our method 
cluster the data into 2 classes: the first class of shapes only have two corners, and the second class has three 
or more comers. Eor a large 9, we cluster the shapes into three classes: one with shapes have two comers, 
one with shapes have four or more comers and one with shapes have three comers. 

We pooled together 100 shapes of DEEX-P46 cells and 100 shapes of NIH-3T3 cells to form a set of 
200 cells. Our goal is to cluster this pooled dataset into two classes: one with the DEEX-p46 cell shapes 
and the other with NIH-3T3 cell shapes. Since we expect a smaller number of clusters here, we use a set of 
small 9, 9 £ {0.1,0.2,0.3, 0.4}, ^ = 1 as our model parameters. The clustering result are shown in Eig. 14 
first row, where each plot is a cluster. The 200 cell shapes are automatically clustered into 3 classes. Cell 
shapes of DEEX-p46 are plotted with thick blue lines, and NIH-3T3 cells are plotted with thin pink lines. 
The NIH-3T3 cell shapes have a large variance, thus our method separates them into 2 classes, one with 
long strip shapes, and the other with star shapes. Setting iT = 2, we compared with ESA -i- PW method in 
the left panel of the bottom row and the K-medians method in the right. Prom the result, one can see that 
our method identifies 3 meaningful clusters instead of 2, and the clustering quality (both the classification 
rate and Rand index) also is higher than ESA -i- PW and K-median method. The classification rates for our 
method, ESA-i-PW and K-medians are 96%, 83.5%, and 81.5%, respectively, and the Rand indexes are 0.82, 
0.73, and 0.69, respectively. 

Protein structure data clustering: In the following experiments, we will use our model to cluster the 
protein structure data introduced in Section 2.1. 

In the first experiment, we choose a small protein structure dataset obtained from SCOP with only 
88 proteins. Based on SCOP, these proteins are from 4 classes (SCOP provides the ground truth). Those 


proteins are pre-processed similar to an earlier study (Eiu et al. 2011 1 . To have a good estimate of the SRVPs 
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Figure 13: Clustering process for NIH-3T3 cell shapes with different 9 values. 
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Figure 14: Clustering result for a subset of cell shapes containing both DLEX-P46 (thick blue) and NIFI-3T3 (thin pink). The first 
row is the result of our method. The bottom left panel shows the result of ESA + PW and the bottom right shows the result of 
K-medians. 
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Figure 15: Protein structure classification on a SCOP subset containing 88 proteins. The first row shows the inner product matrix 
between protein structures after clustering and the histogram of cluster number k. The second row shows the four clusters. 


from the raw data, we smooth the resampled protein structures with a Gaussian kernel. We also added one 
residue at both N and C terminal of each protein chain by extrapolating from the two terminal residues 
to allow some degrees of freedom on matching boundary residues. The added residues are removed after 
matching. Note that these smoothed SRVFs will only be used for searching optimal re-parameterizations 
7 and rotations SO(3) to get the inner product between protein structures. Then we apply our mixture 
of Wisharts model to the inner product matrix S G C/+ (M) and get the clustering result, where we use 
parameters 6 G {0.1, 0.2, 0.3,0.4} and ^ = 1. The final clustering results are shown in Fig. 15 The 
clustering rate is 100% compare with the ground truth provided by SCOP. 

In next experiment, we choose 20 classes with at least 10 elements in each class from SCOP dataset to 
form a subset with 602 proteins. The final clusfering resulf shows some clusfers wifh only a few elemenfs 
which we consider as outliers. In fhis experimenf, our model idenfifies 17 oufliers (7 small clusfers). Affer 
removing fhese oufliers, fhe remaining 585 profeins are clusfered info 38 classes. The clusfering rafe is 
84.1%. The firsl row in Fig. 16 shows fhe inner producf mafrix corresponding fo fhe 585 profein sfrucfures 
(after puffing elemenfs in fhe same clusfer fogefher), and fhe posterior esfimafe of fhe parfifion mafrix 
B. The second row shows firsf four clusfers of fhe clusfering resulf affer fhe alignmenf (removing shape¬ 
preserving fransformafions). One can see fhaf inside each clusfer, fhe shapes of fhese profein sfrucfures are 
very similar fo each ofher. As comparisons, we remove fhe oufliers defected by our mefhod, fhen apply ESA 
+ PW and K-medians mefhod fo clusfer fhe leff 585 profeins by seffing K = 20. ESA -i- PW gefs 75.99% of 
classificafion rafe and K-medians gefs 63.42%. The Rand indexes for our model, ESA -i- PW and K-medians 
are 0.95,0.93, and 0.91 respectively. As evidenf, we obfain a good clustering resulf based on only fhe shape 
of fhe profeins. 
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Figure 16: Protein structure classification on a SCOP subset of 602 proteins. The first row shows the inner product matrix between 
protein structures after clustering and the corresponding inferred B, respectively. The second row shows the first four clusters. 


5. Conclusion 

We have presented a Bayesian approach for clustering of shape data that does not require the number of 
clusters a priori. Instead, it assumes a flexible prior on the space of data partitions and studies the resulting 
posterior distribution on the clustering configuration. This prior is derived from a Dirichlet process (realized 
using the Chinese restaurant process) and the likelihood is given by the Wishart distribution. The Bayesian 
inference provides a reasonable solution for each of the simulated and real shape datasets studied in this 
paper. This fully automated method provides a way for shape-based partitioning of large datasets even in 
situations where visualization-based tools are not feasible. 
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